Identify “Unknown Unknowns” with AI-Driven Data Quality
July 24, 2024
In 2002, Secretary of Defense Donald Rumsfeld made the concept of “unknown unknowns” famous. The idea is that the most significant risks come from knowledge gaps you aren’t even aware of.
The concept is directly applicable to data quality. By thinking about the types of possible issues you’re aware of (or not), and understand (or not), you can ensure comprehensive coverage that goes way beyond tests—those just cover the known knowns. Let’s get to know this framework, and how Anomalo covers all parts of it.
The quadrants of knowledge
The four quadrants of knowledge were first proposed in the 1950s as the Johari window, a cognitive psychology framework for identifying what you and your colleagues know and don’t know about you.
On one axis is awareness, which is what you know about. On the other is understanding, that is, what you’re currently capable of learning the details of. Awareness is knowing a book exists; understanding is grasping what the book is saying. In this framework, awareness and understanding always specified in order, so a “known unknown” is something that you’re aware of but don’t understand.
Let’s take a look at how you could use these quadrants to map out uncertainty around an upcoming flight:
- Known knowns: what you’re aware of and that you understand (e.g., the scheduled time of a flight)
- Known unknowns: what you know to pay attention to but don’t yet understand the details of (e.g., the chance of a delay caused by mechanical issues)
- Unknown knowns: understanding that you have but that you aren’t considering (e.g., you aren’t paying attention to the bad weather, but if it was pointed out to you, you would understand that your flight could be delayed because of it)
- Unknown unknowns: what you neither understand about nor are considering (e.g., there could be a mouse running loose on the plane)
Let’s break down how to apply this matrix to data quality, and how Anomalo addresses each quadrant.
Anomalo’s quadrants of knowledge. The platform offers tools to address all four of them, with AI covering the trickiest: unknown unknowns.
Known knowns: rules
Traditional data quality rules fall into this first quadrant. If you understand the potential issues and exactly how to identify bad data, you can write rules.
There are two big problems with relying exclusively on rules in a modern data environment. First, they require manual effort to create, test, and maintain, so they don’t scale well. Second, they only cover the known knowns, so they can create a false sense of security while leaving you exposed to every other quadrant.
That said, when you know exactly what you need—say, “sound an alert if a certain column has new null values”—rules give you immediate confidence. Rules are also the only way for a subject matter expert to articulate what they expect the data to look like, and ensure that the data always conforms to that expectation.
Anomalo example: You can set rules with specific parameters, either by using our no-code interface or by writing your own SQL. These validation checks flag for specific failures and determine whether new or historical data conforms to known standards. As with all of Anomalo’s checks, you can specify who should be notified and how the alerts should be sent.
Known unknowns: metrics
This quadrant represents areas you understand could present issues, but it would be hard to articulate exactly how you would catch those issues with rules.
Let’s say you’re the analyst at an e-commerce company. The size of a customer’s first order can vary significantly due to various factors, including the acquisition channel, whether paid search, organic, social, etc.
Depending on the marketing team’s ad buys, seasonality, trends, and many other factors, sales numbers could vary quite a bit, especially for various items or categories.
With so many factors whose impacts can shift over time, hard-and-fast rules are costly and brittle. Instead, it makes sense to evaluate the numbers with the data-quality equivalent of a sniff test in this case: does this fit the pattern I’d expect?
Depending on the circumstance, you could do this manually by directly inspecting or looking at aggregated data. However, at enterprise scale, it’s more efficient and reliable to lean on machine learning to pay attention to a certain statistic, such as the average or sum of a column, and how that statistic changes over time.
Anomalo example: As the e-commerce analyst, you could instruct Anomalo to pay attention to the average order value by source. If there’s a big drop in, say, sales from social, you might discover a dropped attribution tag or some other issue to be fixed. (Learn more in our blog series “Monitoring Metrics.”)
Unknown knowns: profiling
This is the most beguiling quadrant. What does it mean not to know what you know?
From a data quality perspective, unknown knowns are issues you aren’t looking for, but that you’d spot if presented to you. That’s where profiling comes in. Using calculations and visualizations, subject-matter experts identify where tables and data flows have gone wrong. This approach can help find systemic issues such as missing data in key columns, invalid data formats, or invalid values.
Anomalo example: Anomalo’s column profile view visualizes the composition of each column in a table, so if you’re familiar with the data you can see at a glance if, say, a typically heterogeneous column is unexpectedly showing less variety. Relatedly, root cause analysis surfaces the segments (rows, columns, or values) that are overrepresented in identified faults in an easy-to-read graph. This turns the unknown knowns into actionable insights, making it much easier to track and fix the issue at hand.
Unknown unknowns: AI monitoring
This is the quadrant where Anomalo’s AI-powered continual monitoring shines.
By definition, you cannot predict what these might be. The traditional approach is simply whack-a-mole: as soon as you’re aware that something is going wrong, start investigating where it came from, and cross your fingers you can fix it before things get out of hand.
Unfortunately, this means you’re always playing catch-up, and sometimes it’s too late. In 2022, a coding error at Equifax went unidentified and unfixed for three weeks, affecting millions of consumers’ credit scores and thousands of credit decisions. NASA lost a Mars orbiter because of mismatched units of measurement.
The good news is, you can be proactive even if you don’t know what problems might emerge. Anomalo’s data quality software continually monitors datasets to learn patterns and detect when data strays too far from the pattern. Not only can it look at basic factors, such as the range of data in a column, but it can also learn and monitor complex relationships, such as the values in column B that are most likely seen with a certain value in column D. Anomalo’s AI goes beyond basic observability by scrutinizing the patterns and relationships of the data itself.
The amazing thing is that Anomalo does this on its own. It can monitor any number of tables, giving everyone from executives to analysts the confidence that they’re basing decisions on sound data regardless of which table it comes from.
Anomalo example: Turning on automated data quality is as simple as toggling a switch. Within a few weeks Anomalo will learn enough about your data to be useful, and within a few months it will be capable of flagging the most important unknown unknowns.
Reducing unknown unknowns is now a known known
There is no silver bullet that addresses all data quality monitoring needs. By combining these four approaches—rules for the known knowns, metrics for the known unknowns, profiling for the unknown knowns, and AI for the unknown unknowns—you can comprehensively cover all knowable data quality issues.
By having read this article, you now can’t unknow that AI is the key to catching issues you couldn’t have anticipated, alongside other data quality tools that Anomalo offers. Customers from Discover Financial to Keller Williams to Buzzfeed use Anomalo to catch issues before they affect their operations, revenue, and reputation. They know, and now you do too. So let’s get to know each other: contact our sales team today.
Categories
- Product Updates
Get Started
Meet with our expert team and learn how Anomalo can help you achieve high data quality with less effort.